AITopics | adaptive gradient algorithm

Collaborating Authors

adaptive gradient algorithm

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Neural Information Processing SystemsDec-24-2025, 00:12:57 GMT

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a square root scaling rule to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.

adaptive gradient algorithm, name change, sde and scaling rule, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)

Add feedback

Towards Theoretically Understanding Why S GD Generalizes Better Than A DAM in Deep Learning Pan Zhou

Neural Information Processing SystemsAug-17-2025, 06:40:52 GMT

In this work, we provide a new viewpoint for understanding the generalization performance gap.

artificial intelligence, deep learning, machine learning, (15 more...)

Neural Information Processing Systems

Country:

North America > Canada (0.04)
Asia > Singapore (0.04)

Genre: Research Report > New Finding (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Reviews: Necessary and Sufficient Geometries for Gradient Methods

Neural Information Processing SystemsJan-26-2025, 21:01:34 GMT

POST-AUTHOR FEEDBACK: I thank the authors for their explanations, which addresses my concerns. I look forward to seeing the camera-ready version with improved results and the clarifications. The paper is well-written, the technical results are precise and correct as far as I can tell, and the results are novel and interesting. However, I am slightly puzzled over a few points, some of which pertain to the interpretation of the results. I believe the paper would significantly benefit from clarifying the following points: * The authors refer to mirror descent with (fixed) weighted Euclidean distance as an adaptive gradient algorithm, but this connection is not clear to me.

gradient algorithm, gradient method, necessary and sufficient geometry, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.41)

Add feedback

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Neural Information Processing SystemsOct-10-2024, 14:10:05 GMT

adaptive gradient algorithm, rmsprop and adam, sde and scaling rule, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

Add feedback

Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

Tucat, Matteo, Mukherjee, Anirbit

arXiv.org Artificial IntelligenceApr-12-2024

In this work, we instantiate a regularized form of the gradient clipping algorithm and prove that it can converge to the global minima of deep neural network loss functions provided that the net is of sufficient width. We present empirical evidence that our theoretically founded regularized gradient clipping algorithm is also competitive with the state-of-the-art deep-learning heuristics. Hence the algorithm presented here constitutes a new approach to rigorous deep learning. The modification we do to standard gradient clipping is designed to leverage the PL* condition, a variant of the Polyak-Łojasiewicz inequality which was recently proven (Liu et al., 2020), to be true for various neural networks for any depth within a neighbourhood of the initialisation. In various disciplines, ranging from control theory to machine learning theory there has been a long history of trying to understand the nature of convergence on non-convex objectives for first order optimization algorithms i.e algorithms which only have access to an (estimate of) the gradient of the objective Maryak & Chin (2001); Fang et al. (1997).

algorithm, convergence, gradient, (16 more...)

arXiv.org Artificial Intelligence

2404.08624

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > Czechia > Prague (0.04)

Genre:

Research Report (0.51)
Overview (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Control Theoretic Framework for Adaptive Gradient Optimizers in Machine Learning

Chakrabarti, Kushal, Chopra, Nikhil

arXiv.org Artificial IntelligenceAug-19-2023

Adaptive gradient methods have become popular in optimizing deep neural networks; recent examples include AdaGrad and Adam. Although Adam usually converges faster, variations of Adam, for instance, the AdaBelief algorithm, have been proposed to enhance Adam's poor generalization ability compared to the classical stochastic gradient method. This paper develops a generic framework for adaptive gradient methods that solve non-convex optimization problems. We first model the adaptive gradient methods in a state-space framework, which allows us to present simpler convergence proofs of adaptive optimizers such as AdaGrad, Adam, and AdaBelief. We then utilize the transfer function paradigm from classical control theory to propose a new variant of Adam, coined AdamSSM. We add an appropriate pole-zero pair in the transfer function from squared gradients to the second moment estimate. We prove the convergence of the proposed AdamSSM algorithm. Applications on benchmark machine learning tasks of image classification using CNN architectures and language modeling using LSTM architecture demonstrate that the AdamSSM algorithm improves the gap between generalization accuracy and faster convergence than the recent adaptive gradient methods.

algorithm, artificial intelligence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2206.02034

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Maryland > Prince George's County > College Park (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

Xie, Xingyu, Zhou, Pan, Li, Huan, Lin, Zhouchen, Yan, Shuicheng

arXiv.org Artificial IntelligenceFeb-27-2023

In deep learning, different kinds of deep networks typically need different optimizers, which have to be chosen after multiple trials, making the training process inefficient. To relieve this issue and consistently improve the model training speed across deep networks, we propose the ADAptive Nesterov momentum algorithm, Adan for short. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the gradient's first- and second-order moments in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an $\epsilon$-approximate first-order stationary point within $O(\epsilon^{-3.5})$ stochastic gradient complexity on the non-convex stochastic problems (e.g., deep learning problems), matching the best-known lower bound. Extensive experimental results show that Adan consistently surpasses the corresponding SoTA optimizers on vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, e.g., ResNet, ConvNext, ViT, Swin, MAE, DETR, GPT-2, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT, GPT-2, MAE, e.t.c., and also shows great tolerance to a large range of minibatch size, e.g., from 1k to 32k. Code is released at https://github.com/sail-sg/Adan, and has been used in multiple popular deep learning frameworks or projects.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2208.06677

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Education (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Qualitative Study of the Dynamic Behavior of Adaptive Gradient Algorithms

Ma, Chao, Wu, Lei, E, Weinan

arXiv.org Machine LearningSep-13-2020

The dynamic behavior of RMSprop and Adam algorithms is studied through a combination of careful numerical experiments and theoretical explanations. Three types of qualitative features are observed in the training loss curve: fast initial convergence, oscillations and large spikes. The sign gradient descent (signGD) algorithm, which is the limit of Adam when taking the learning rate to $0$ while keeping the momentum parameters fixed, is used to explain the fast initial convergence. For the late phase of Adam, three different types of qualitative patterns are observed depending on the choice of the hyper-parameters: oscillations, spikes and divergence. In particular, Adam converges faster and smoother when the values of the two momentum factors are close to each other.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

2009.06125

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback